Welcome everybody. Welcome to deep learning. And sorry that I'm only showing up at this
appointment but I had to travel quite a bit. But now we can go ahead and enjoy the teachings
about deep learning. Okay, so what you've seen so far was the introduction and the last
lecture we kind of had technical problems and couldn't talk about so much. So I heard
that you were up to slide number 12, is that approximately what you could speak about before
the projector had a breakdown? Yes, no, maybe? Okay, but you've heard a bit about feed-forward
neural networks and now let's see. So you talked about this and universal approximation
classification trees, how to map them. So you've seen essentially the gist of this is that
with universal approximation one hidden layer is essentially enough to approximate any continuous
function on a compact set. Speaking that the compact set is essentially the distribution
of your data. So everything that we're doing works on the same distribution that you have
seen in the training data set, but there are certain functions that can be better represented
if you construct them not just with a single hidden layer network, but if you construct
them let's say with two layers or even more. Later in the lecture we will hear about exponential
feature reuse that the deeper you build your network the more paths emerge in your network
and you can reuse also many of those features. So the main point that I want you to follow
to understand here is that it's not just universal. So universal approximation tells us one layer
is enough, but actually when you stack layers on top the representational power of the network
increases and you can learn here we had an example where we had something like six neurons
and we couldn't model the function very well, but then if we increase to just seven but
arrange them in two layers we are able to model this function accurately and this is
a zero error and by the way with two layers any tree any decision tree can be approximated
with a neural network without error. So it's simply that the idea is that you use the first
layer to create the partitions and with the partitions you can then model in the second
layer like every patch here and assign it a class. So the first layer takes essentially
the modeling function to form the partitions and then the second layer you just assign
one class to the respective partition and it works for every decision tree. So other
thing that you should take away from this the mechanism that we use in the neural networks
is very powerful. One layer enough to model any kind of function, second layer then or
deeper layers you can model really complex systems and this is also what deep learning
is about that we start modeling very complex systems in a rather compact form. Okay so
this is essentially the gist of the universal approximation theorem that we can approximate
any kind of function on a compact set and the main problem that we still now have is
we don't know how to actually determine the parameters right. So we know we just know
that there exists a solution but we don't know how to get there and this is also one
fundamental problem that we still need to solve. So far we've only learned that there's
potentially a solution or some good solutions but how do we actually get there. And now
we have to go ahead and have to go from the activations to classifications and this is
why we introduce the so-called softmax function. So far we had essentially described the ground
truth the label by some as a y is the ground truth and y hat is the estimate and we were
using essentially minus one to one where we had two classes but this will only apply for
a two class problem right. So if you want to go ahead to multiple classes you have somehow
have to model it in a different way so you cannot just take one number as the class output.
So instead we can use a vector. Now you see that this is a bold we use bold script to
indicate vectors and now we have a set of scalars from one to k where k is the number
of classes and now we can also have many classes we just need a output vector that is just
big enough to model all of the classes. If you have a hundred classes then your output
vector will be a hundred classes. And then for every index we essentially have zero if
it's not the class and we have one if it is the correct class. So this way this is also
called a one hot encoding so in the in the ground truth vector in the y not the y hat
Presenters
Zugänglich über
Offener Zugang
Dauer
01:24:10 Min
Aufnahmedatum
2019-10-29
Hochgeladen am
2019-10-29 21:49:02
Sprache
en-US